fcoll/vulcan accelerator support #12678

edgargabriel · 2024-07-13T15:19:50Z

No description provided.

If the user user input buffers are GPU device memory, use also GPU device memory for the aggregation step. This will allow the data transfer to occur between GPU buffers, and hence take advantage of the much higher GPU-GPU interconnects (e.g. XGMI, NVLINK, etc.). The downside of this approach is that we cannot call directly into the fbtl ipwritev routine, but have to go through the common_ompio_file_iwrite_pregen routine, which performs the necessary segmenting and staging through the host memory. Signed-off-by: Edgar Gabriel <[email protected]>

add support for using accelerator buffers in the aggregation step of the read_all operation. This is in common/ompio instead of the fcoll component, since all fcoll components (except individual) use at the momment the default implementation, which has been moved to common/ompio a while back to avoid code duplication. Signed-off-by: Edgar Gabriel <[email protected]>

performance measurements indicate that in most cases using a CPU host buffer for data aggregation will lead to better performance than using a GPU buffer. So turn the feature off by default. Signed-off-by: Edgar Gabriel <[email protected]>

qkoziol · 2024-09-03T21:59:43Z

Looks fine to me.

I guess my only question is why the mca_common_ompio_file_iread_pregen / mca_common_ompio_file_iwrite_pregen routines are necessary, if the code was previously calling the preadv function.

qkoziol · 2024-09-03T22:00:42Z

Looks fine to me.

I guess my only question is why the mca_common_ompio_file_iread_pregen / mca_common_ompio_file_iwrite_pregen routines are necessary, if the code was previously calling the preadv function.

Ah, this is for the ipreadv function.

So, why not use it for CPU memory buffers also?

edgargabriel · 2024-09-04T13:23:08Z

@qkoziol thank you for your review! Let me try to clarify your question, and also use this as an opportunity for documenting some of the changes. The pipeline protocol is used for individual I/O in cases where we need to use an additional staging buffer for the operation, e.g. GPU buffers or if we need to perform data conversion for a different data representation. Regular file_read/write operations don't need the additional staging step.

When doing data aggregation into GPU buffers in collective I/O, we therefore cannot simply call the fbtl/ipreadv or fbtl/ipwritev function (as we do for host buffers), but we want to invoke the pipeline protocol. However, in contrary to the individual I/O operations, some of the operations are not necessary. Specifically, we can use the pre-calculated offsets from the collective I/O operation (and hence don't need to repeat the file-view operations anymore), and we don't need to update the file pointer position (that is also done in the collective I/O operation). Hence, these are the two iread_pregen/iwrite_pregen functions.

Lastly, each collective component has its own write_all operation, but they all use the same algorithm for the read_all, which has because of that been moved from the components into the common/ompio directory. This is something that might have to change in the near future, but our focus in the past was always on the write_all operations and we neglected read_all a bit.

github-actions bot added the Target: main label Jul 13, 2024

edgargabriel changed the title ~~Topic/fcoll vulcan accelerator support~~ fcoll/vulcan accelerator support Jul 13, 2024

edgargabriel force-pushed the topic/fcoll-vulcan-accelerator-support branch from 74a3029 to 8b24867 Compare July 18, 2024 19:06

edgargabriel force-pushed the topic/fcoll-vulcan-accelerator-support branch 2 times, most recently from 0beeef3 to f8bc3fd Compare August 12, 2024 21:04

edgargabriel and others added 3 commits September 3, 2024 06:26

edgargabriel force-pushed the topic/fcoll-vulcan-accelerator-support branch from f8bc3fd to d30471c Compare September 3, 2024 13:26

qkoziol approved these changes Sep 3, 2024

View reviewed changes

edgargabriel merged commit 1afb524 into open-mpi:main Sep 4, 2024
14 checks passed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fcoll/vulcan accelerator support #12678

fcoll/vulcan accelerator support #12678

edgargabriel commented Jul 13, 2024

qkoziol commented Sep 3, 2024

qkoziol commented Sep 3, 2024

edgargabriel commented Sep 4, 2024 •

edited

Loading

fcoll/vulcan accelerator support #12678

fcoll/vulcan accelerator support #12678

Conversation

edgargabriel commented Jul 13, 2024

qkoziol commented Sep 3, 2024

qkoziol commented Sep 3, 2024

edgargabriel commented Sep 4, 2024 • edited Loading

edgargabriel commented Sep 4, 2024 •

edited

Loading